2
Quantization of Neural Networks
Quantization is a strategy that has demonstrated outstanding and consistent success in both
the training and inference of neural networks (NN). NN present unique opportunities for
advancement even though the issues of numerical representation and quantization are as old
as digital computing. Although most of this quantization survey is concerned with inference,
it is essential to note that quantization has also been successful in NN training [8, 42, 63,
105]. Innovations in half-precision and mixed-precision training in particular [47, 80] have
enabled greater throughput in AI accelerators. However, going below half-precision without
significant tuning has proven to be challenging, and most recent quantization research has
concentrated on inference.
2.1
Overview of Quantization
Given an NN model of N layers, we denote its weight set as W = {wn}N
n=1 and the input
feature set as A = {an
in}N
n=1. The wn ∈RCn
out×Cn
in and an
in ∈RCn
in are the convolutional
weight and the input feature map in the n-th layer, respectively, where Cn
in and Cn
out re-
spectively stand for the input channel number and the output channel number. Then, the
outputs an
out can be technically formulated as:
an
out = wn · an
in,
(2.1)
where · represents matrix multiplication. In this paper, we omit the non-linear function
for simplicity. Following the prior works [100], quantized neural network (QNN) intends to
represent wn and an in a low-bit format as
Q : = {q1, · · · , qU},
where qi, i = 1, · · · , U satisfying q1 < · · · < qU, are defined as quantized values of the
original variable. Note that x can be the input feature an or the weights wn. In this way,
qwn ∈QCn
out×Cn
in and qan
in ∈QCn
in such that the float-point convolutional outputs can be
approximated by the efficient XNOR and bit-count instructions as:
an
out ≈qwn ⊙qan
in.
(2.2)
The core item of QNNs is how to define a quantization set Q, which is described next.
2.1.1
Uniform and Non-Uniform Quantization
First, we must define a function capable of quantizing the weights and activations of the
NN to a finite set of values. The following is a popular choice for a quantization function:
qx = INT( x
S ) −Z,
(2.3)
DOI: 10.1201/9781003376132-2
16